Identification and interpretation of gait analysis features and foot conditions by explainable AI

Clinical gait analysis is a crucial step for identifying foot disorders and planning surgery. Automating this process is essential for efficiently assessing the substantial amount of gait data. In this study, we explored the potential of state-of-the-art machine learning (ML) and explainable artificial intelligence (XAI) algorithms to automate all various steps involved in gait analysis for six specific foot conditions. To address the complexity of gait data, we manually created new features, followed by recursive feature elimination using Support Vector Machines (SVM) and Random Forests (RF) to eliminate low-variance features. SVM, RF, K-nearest Neighbor (KNN), and Logistic Regression (LREGR) were compared for classification, with a Majority Voting (MV) model combining trained models. KNN and MV achieved mean balanced accuracy, recall, precision, and F1 score of 0.87. All models were interpreted using Local Interpretable Model-agnostic Explanation (LIME) method and the five most relevant features were identified for each foot condition. High success scores indicate a strong relationship between selected features and foot conditions, potentially indicating clinical relevance. The proposed ML pipeline, adaptable for other foot conditions, showcases its potential in aiding experts in foot condition identification and planning surgeries.


Retrospective data
The Heidelberg Foot Measurement Method (HFMM) 25 was employed for data collection using a 12-camera VICON motion capturing system (Vicon Motion Systems Ltd.Oxfordshire UK).The gait data collection procedure involved a self-selected speed walking on a seven-meter walkway, and five to seven strides with good data quality were taken into account in each single trial.Vertical ground reaction force was also collected using force plates (Kistler Instruments, Winterthur, Switzerland) to segment gait cycles with respect to foot-off timing points.
The dataset contains 12 functional angles: Tibiotalar flexion, Medial arch inclination, Medial arch angle, Lateral arch angle, Subtalar inversion, Forefoot/ankle supination, Forefoot/midfoot supination, Forefoot/hindfoot abduction, Forefoot/ankle abduction, Inter MT I-V angle, Hallux adduction, Hallux flexion.These are described by Simon et.al. 25 as follows and comprehensive illustrations are available in their published work.Tibiotalar flexion involves the rotational movement between the tibia and talus, indicated by the movement of the calcaneus and navicular, around the malleolar line, primarily in the sagittal plane.Medial arch inclination is the angle formed between the malleolar line and the line perpendicular to the plane of the medial arch, mostly within the frontal plane.Medial arch angle is the 3D absolute angle formed between the line from the medial calcaneus marker to the navicular and the first metatarsal (MT I), mostly within the sagittal plane.Lateral arch angle is the angle between the line from the lateral calcaneus marker to the fifth metatarsal head (MT V head) and MT V, primarily in the sagittal plane.Subtalar inversion is the rotation of the calcaneus around the subtalar axis, primarily within the frontal plane.Forefoot/ankle supination refers to the angle between the metatarsal head line and the malleolar line, mostly within the frontal plane.Forefoot/midfoot supination is the angle between the metatarsal head line and the metatarsal base line, mostly within the frontal plane.Forefoot/hindfoot abduction is the rotation of the metatarsal head line relative to the calcaneus, primarily in the transverse plane.Forefoot/ankle abduction is the rotation of the metatarsal head line relative to the malleoli line, mostly in the transverse plane.Inter MT I-V angle is the absolute angle between the first metatarsal (MT I) and the fifth metatarsal (MT V) when projected into the transverse plane.Hallux adduction is the angle between the first metatarsal (MT I) and the hallux projected into the plane defined by the metatarsal heads and the foot axis, primarily in the transverse plane.Hallux flexion is the rotation between the hallux and the first metatarsal (MT I) around an axis perpendicular to the foot axis, in the plane of the foot axis and metatarsal head line, mostly within the sagittal plane.Each functional angle data was split into strides using foot-off timing points, averaged point-wise across 5-7 strides from one single trial, and normalized to a percentage gait cycle (i.e. between 0 and 100%) to obtain the mean time series in the retrospective dataset.In addition, the standard deviation (std) time series were calculated to provide point-wise variation among strides.Each time series consisted of 101 data points representing the gait cycle from 0 to 100%.Note that, only the data of the affected legs were selected from the dataset.

Feature engineering
The feature extraction and selection steps are denoted in Fig. 1.Two new time series were derived and scalar features were computed from all of the time series as in the study of Wolf et.al. 26 .Subsequently, median imputation was performed to fill the missing values as preserving the data distribution.The data was normalized to rescale the values between 0 and 1.This was done to prevent over-or under-representation of time series with extremely high or low magnitudes compared to the other time series.Subsequently, features with low variance were eliminated by ML-based feature selection algorithms since they might have negligible effect in classification.In the subsections below, these steps are explained in detail.

New Time series derivation
In order to extract further information from the mean time series, two new time series (i.e. the first gradient and difference from normative) were derived similarly to the automated feature assessment workflow of Wolf et al. 26 .The first gradient time series illustrates the changes occurring throughout the gait cycle, while the latter indicates how the time series deviates from the average typical foot (i.e., normative).The first gradient " V " of the mean time series " U " was calculated in a discrete domain according to the Formula (1): where "k" is the data point index within a gait cycle, which is in our case a number between 0 and 100.The difference from the normative time series " D " was calculated according to the Formula (2): For each functional angle, reference normal time series U norm was calculated by averaging the corresponding time series across all subjects with typical feet.
Computing scalar features from the segmented time series significantly decreases the ML input size causing lesser but still representative data while reducing the complexity of the model.The features explained below were separately computed for the stance and swing phases of the gait cycle, except the range of motion feature, which was calculated for the whole gait cycle.For mean and derived time series, the computed scalar features were as follows: Minimum and maximum values and their gait cycle instances, i.e., the temporal position in the gait cycle, referred to as "timing" (x-axis in Fig. 2) for simplicity throughout the rest of the article.For the standard deviation (std) time series, only the maximum value and its corresponding timings were taken into account as features.This decision was based on the rationale that certain foot conditions might induce more pronounced variations in gait patterns across strides and these features will supply this information.As indicated in Table 2, each functional angle was associated with 15 features separately for the stance and swing phases.The total number of features increased to 31 when computed separately for these phases (15 × 2 = 30), with the addition of the range of motion as an extra feature.The total number of extracted features for all functional angles can be determined by multiplying the number of functional angles by the number of features, resulting in 12 × 31 = 372.Furthermore, the dataset incorporated the "Foot_off " scalar value, denoting the time when the foot is off, to distinguish between the stance and swing phases.Two "Foot_off " values were included, one for the mean and one for the standard deviation of it.Consequently, the overall number of features for each subject reached 374.
Figure 2 shows the computed features on an example mean time series "Tibiotalar Flexion of Left Foot for a Hallux Rigidus Subject" and the difference from the normative.

Median imputation and normalization
We employed median imputation to the time series data to fill missing values while preserving the data distribution.Similar to the study on missing longitudinal data 41 , the missing values were filled with the median of the values in the corresponding timing of the other subjects having the same foot condition.The input values were of different scales and had to be normalized between 0 and 1 for simplifying the learning process of the ML algorithms 42 .Moreover, a natural logarithmic transformation was applied to skew the input distribution to the normal distribution.However, no significant improvement through skewing was observed.www.nature.com/scientificreports/ Feature selection Computed scalar features ended up with 374 scalar values per subject.Not just for computational applicability, but also for problem-related specification 43 , the most useful subset of features was selected.Recursive feature elimination with cross-validation algorithm of Scikit-Learn module 44 was used for feature selection.This algorithm was implemented with a Support Vector Machine (SVM) 45 algorithm as an estimator, which was trained recurrently with the newest subset of the features, the least important of which (regarding model coefficients) was then eliminated for the next trial.The feature subset securing the maximum accuracy in the estimator algorithm was then assumed to be the ideal subset.In addition to this algorithm, a Random Forest (RF) 46 model was trained with all computed features and the importance parameter 44 was used to select the most relevant features by eliminating the ones below a threshold.The intersection of the selected relevant feature subsets of these two algorithms was used as the optimum subset for training the classifier models.For this optimum subset, 334 of the 374 features were selected, which was the intersection of the outcomes of the aforementioned algorithms.

Classification for the identification of foot conditions
Figure 1 shows the classification steps.Briefly, the dataset was split for training and testing and median imputation and normalization were applied to each subset, separately.In training, hyperparameters of the ML models were optimized and then models were trained.In testing, the trained models were tested with the split unseen test data.

Dataset splitting
The dataset consists of the selected scalar features of the functional angles (as inputs) and their foot conditions (as outputs).The dataset was split into training and testing groups with a ratio of 0.85 to 0.15 in a stratified way, respectively.The test data was isolated to guarantee that there was no information leakage between the train and test datasets.Yet, the abovementioned preprocesses (median imputation and normalization) was executed for the training and test data separately.Note that the test data was scaled using the scaling factors provided in the scaling of the training data to avoid dimension mismatches and scale impropriety.

Machine learning algorithms
For the classification of the aforementioned foot conditions (see Table 1), 5 conventional ML algorithms were evaluated in multiclass classification strategy and their results were compared: Support Vector Machines (SVM) 45 , Random Forest (RF) 46 , Logistic Regression (LREGR) 47 , K-nearest Neighbor (KNN) 48 , and a majority voting (MV) 44 algorithm were trained separately.The MV model was implemented with a weighted soft voting technique that uses the trained SVM, KNN, RF, and LREGR models and gives the weighted voting outcome of them as the final output.

Hyperparameter tuning
Within the cross-validation, SVM, KNN, and RF models were tuned with a randomized search algorithm before training for finding the best hyperparameters 49 .For the LREGR model, a computationally expensive grid search algorithm was required for ensuring model convergence.The hyperparameters of SVM, RF, LREGR, and KNN are as below: • SVM: Type of kernel, kernel coefficient (gamma), regularization parameter • RF: Number of trees in the forest, split quality criterion, minimum required samples for splitting a node, minimum required samples for being a leaf node, maximum depth of trees, maximum number of features for splitting, existence of bootstrapping • LREGR: Regularization strength, type of solver algorithm, maximum number of iterations • KNN: Type of distance metric, type of weight function, type of algorithm for computing nearest neighbor, size of leaf (for the requiring algorithms) MV weights were optimized through a randomized search algorithm after training the first four models (SVM, KNN, RF, and LREGR).Further details about the mentioned hyperparameters of the models can be found in the documentation of Sklearn Framework for 44 .

Training and testing
The tuned models were trained without a hard limit on the number of iterations.Instead, a stopping criterion was set with a threshold of change in the loss by the value of 1e-4 for two consecutive iterations.3-times repeated leave one out cross-validation was applied, similar to the approach used by Bajpai et.al. 50 .While the first 4 models do not need a special order of training, the MV model has to be trained at last, since it uses the outputs of the first 4 models as inputs.
To quantify the classification success of the models, multiclass averages of balanced accuracy, recall, precision, and F1 scores of the models were calculated using the counts of true positives (TP), true negatives (TN), false negatives (FN), and false positives (FP).Table 3 lists the formulas of the measures and their descriptions 51 .The highness of these performance measures altogether, without them being correlated, indicates the success of the classification 52 .These scores were calculated for the isolated test dataset.www.nature.com/scientificreports/

Model interpretation
Before model interpretation, the features were rescaled to reach their original values.The input features of the trained models were evaluated in terms of their relation to the decisions of the models.For this purpose, LIME algorithm was used, which observes the effect of each input feature on the output by doing perturbations on the inputs 8 .
The LIME algorithm assigns weights and intervals to each feature, indicating the extent of influence each feature has on classification output of the model and the interval within which the feature exerts the greatest influence.However, since the intervals of the feature values are more complex to be interpreted by the clinicians, we only considered the weight of the features.We have calculated the mostly weighted 5 features in average for a randomly selected 20 subjects to be classified in each foot condition, for which these 5 features are then defined as the most relevant in classifying them.As from the nature of the LIME algorithm in our case, the magnitude of the weight was decreasing significantly after the first 5 features, which allowed us to be able to interpret the models using them.

Ethics approval and consent to participate
Ethics approval was obtained from Ethical Commission of the Medical Faculty of the University of Heidelberg "Ethikkommission der Med.Fakultät der Universität Heidelberg" S-850/2019.All methods and procedures were carried out in accordance with relevant guidelines and regulations.Contacting patients and asking for their consent for a retrospective analysis of their data for the described research purpose is not done in this study.The number of persons to be contacted would be large, and in some cases the contact would be in relation to medical care that took place several years ago, which means that a corresponding change of address of the persons concerned is not unlikely.The protection of the anonymized biomechanical data is of secondary importance compared to the described research interest, because in particular.a) the data to be evaluated are already available at the research centre and the original collection took place in the context of routine medical care and the data are to be processed here only for research purposes, and b) for the purpose of the research, only persons who were already authorized to inspect personal data on the occasion of routine medical care are allowed to do so, and c) Personal data will not be passed on to external bodies.d) A waiver for the need of informed consent was granted by the Ethical Commission of the Medical Faculty of the University of Heidelberg "Ethikkommission der Med.Fakultät der Universität Heidelberg".

The tuned hyperparameters
The tuned hyperparameters of the aforementioned models are as follows.In the SVM a linear kernel with a regularization parameter (C) of 70, was used.In the RF, the number of trees in the forest was selected as 6984, the GINI impurity criterion was used, the minimum required subjects for splitting a node was selected as 2, and the minimum required subjects for being a leaf node was selected as 1, max depth of trees was limited to 40, the square root of the total feature number was selected as the maximum feature number to be considered and a bootstrapping was implemented.In the LREGR, the inverse of regularization strength (C) was selected as 0.7 and the solver was selected as stochastic average gradient (SAG) algorithm.The LREGR was executed with a maximum iteration number of 5000 for converging the model.In the KNN, the Manhattan distance metric was used for 6 neighbors with uniformed weights and a KDTree algorithm was used for nearest neighbor calculation, for which a leaf size of 10 was used.

Training and classification scores
The learning curves and scalability graphs are shown in Fig. 3 for SVM, RF, LREGR, KNN, and MV respectively.The faded areas show the standard deviations in each.The learning curve graphs show the average accuracy score for training and cross-validation sets.In all graphs, the x-axis corresponds to number of samples, not the number of iterations.When there is sufficient number of samples, training gave the score of 1.0, showing that the input features provided to the classifier models are in general distinctive.Training scores were descended with the increasing number of samples, which suggests that the model is becoming better in generalization with the increasing number of samples.Successful cross-validation scores are attained only when a sufficient number of samples are included.The cross-validation score graphs for models that do not entirely converge suggest that Table 3. Performance measures, their formulas, and definitions (31).

Measure Formula Description
Balanced accuracy   4 below.All of the trained models achieved high scores in every metric, meaning that not only the foot conditions were predicted correctly, but also false prediction was too rare.

Model interpretation outcomes
In Table 5, the most relevant kinematic features obtained using LIME algorithm are listed for each foot condition.In each feature name in Table 5, the following information is to be read: The short name of the functional angle the feature belongs to (as in Table 1), the type of the derived time series if not itself, the segment of the gait cycle to which the feature belongs to.If the standard deviation time series is used instead of the mean time series, it is indicated with a |STD| in the short time series name.The length of the bars corresponds to the number of selections of a feature as one of the most dominant ones.By examining the length of the bars representing the number of subjects for which the feature was relevant, along with the color of the bars corresponding to the model, one can determine per model which features are the most relevant for each foot condition.It is important to note that while this table provides an overall perspective, it is also possible to identify the most relevant features for each individual patient.This information could be valuable in the diagnosis and treatment for each patient.
Reading example of Table 5: In the classification of Tibiotalar osteoarthritis + partial ankle replacement, minimum medial arch angle during both stance and swing phases (the longest bars in the table) were found as the most relevant feature by RF, KNN, and MV models.The other relevant features are to be seen in the table.In the similar manner, the classification of Cavovarus, the timing of minimum first gradient of tibiotalar flexion in stance phase was found the most relevant feature by LREG, SVM, and MV models.

Discussion
In this study, it is aimed to automate the feature selection and foot condition identification to facilitate the diagnosis of foot conditions and hence, to support experts and physicians in their clinical assessment.An ML pipeline is achieved using the state-of-the-art feature elimination and ML algorithms.LIME is utilized to interpret the outcomes of the ML models so that the experts using this pipeline have feedback about how ML models predict the foot conditions.
Our findings indicate that all the algorithms showed good prediction with a minimum of 0.82 F-1 scores.The success scores of the algorithms are similar but KNN and MV yield the highest scores.When the dataset has a class imbalance, namely some classes have very low data compared to the others, the ML models in general could tend to learn more about the classes with high data.RF is known to overcome class imbalance inherently, however, the test scores in Table 4 indicate that other models also perform well to deal with the class imbalance.
High success scores indicate mathematically strong relationship between the selected features and foot conditions.Moreover, different mathematical features can be computed manually and easily added to the created models.Additionally, automatic feature generation or even featureless classification methods can be explored under the condition that the generated features align with the kinematic characteristics of the joints to ensure interpretability.
Note that the data size is comparably small (i.e.348 subject data).The models will improve, as the new data arrives or new features are added.Since MV utilizes the other ML models, it will likely have the highest score when there is new data.Another limitation of the data is that averaging across strides may induce nongait features, especially if the variability among the strides is significant.This variability is reduced by asking participants to walk in their self-walking speed for a certain time before starting the measurement on a walkway to reach a fairly stable walking condition.It was shown by some studies that the gait cycle becomes significantly reliable after certain walking strides 53,54 .Upon achieving a stable walking condition, stride variability becomes a valuable measure, as it can signify the underlying foot condition.Consequently, the study includes standard deviation values to capture this aspect.Furthermore, this type of averaging has been performed with patients who have foot disorders, ambulation, or amputation 26,[55][56][57][58] .However, analyzing stride data directly; i.e. without averaging, could provide deeper insights across strides and enhance the feature set.This approach expands the dataset size by five to seven times.For these reasons, non-averaged data will be evaluated in future studies.
Furthermore, the model can be extended to classify other functional foot conditions that are not included in this study.Progress in digital patient records of specialized treatment centers will facilitate and strengthen the assistance opportunities of ML and XAI in orthopedic diagnostic decision-making.The trained models do not require high computational resources for making the classification.Therefore, they can be easily integrated into any type of software (e.g.web-based) in the future.www.nature.com/scientificreports/Another potential improvement involves leveraging deep learning to enhance model performances, while preserving interpretability.To achieve this, rather than relying solely on time series data, the focus should still be on crafting models that utilize manually extracted features as inputs.This is essential for maintaining relevance in the XAI outcomes regarding the clinicians' understanding.Although deep learning methods typically need a larger dataset than traditional ML methods and neural networks, few-shot learning techniques could be used to train a model with a limited amount of data.Exploring this approach will be a focus of our future research.The ML methods generate black-box models, meaning that how these models make a prediction is not transparent and very difficult to interpret.It is a common problem in neural network-based models, as well and there are new methods to interpret the resulting predictions of the models.In this study, LIME was employed for the interpretation of the model predictions by showing the most relevant features in the prediction.Furthermore, if the same features are selected by most of the models, it can be stated that those features might have strong relevance in the classification.Through computational analysis, Wang et.al. conducted a study on the biomechanical effects of total ankle replacement (total ankle arthroplasty) and ankle arthrodesis during the stance phase, specifically on the inner foot.Their findings revealed that the deviations in ankle join motion are compensated by angular displacement in the fore-foot, resulting in different force transmission among segments, joint contact pressure, and bone stress distribution compared to intact foot 59 .Tibiotalar osteoarthritis + partial ankle replacement might have similar effects.Considering the LIME results, it can be said that these effects exert a significant influence on the medial arch in both stance and swing phases.In cases of Planovalgus and Cavovarus, the majority of relevant Table 5.The most relevant five classifier features for each foot condition.Colors representing the models: Green, dark blue, orange, light blue, and yellow for RF, KNN, LREG, SVM, and MV, respectively.The bar length is related to the number of subjects for which the feature was relevant.www.nature.com/scientificreports/features are associated with medial arch and tibiotalar flexion.This alignment with these foot conditions is logical since compared to typical feet, Planovalgus and Cavovarus are characterized by lower and higher medial arch, respectively.For Clubfoot, the most relevant features pertain to foot abduction and supination, which is again logical since in this foot condition, the foot is turned inward.Upon evaluating the LIME results across different algorithms, it becomes evident that particular algorithms work better to identify relevant features associated with specific foot condition.Remarkably, in cases of Planovalgus and Clubfoot, certain features are exclusively found as relevant by an individual algorithm (RF, KNN, LREG, or SVM).For instance, in the case of Clubfoot, min of forefoot/ankle abduction in swing phase was identified only by RF and MV.The employment of MV allows us to harness the distinctive strengths of all the algorithms within a unified MV model.This is reflected in the LIME results, where MV emerges as the most effective method for identifying nearly all of the relevant features.It is possible that there could be a relation between operation of the ML model that exclusively identified relevant features across the patients and the gait characteristics of the foot for a specific foot condition as well as the degree of dispersion of features among patients.This point will be considered in future studies.
Our study is pioneering the generation of new insights for clinicians, potentially paving the way for future research driven solely by clinical priorities and concerns.Note that for a comprehensive understanding of the clinical relevance of these findings, the LIME results will be discussed with the experts and physicians for the validation of the gained insights.The LIME results may help them discover unknown features that might be critical in a successful inspection of the foot conditions.

Conclusions
We built an ML pipeline including feature selection, classification, and an interpretability steps.In the feature selection, we extracted features manually and eliminate the ones with low variance.We compared the ML models and showed their ability in the classification of the foot conditions that might reduce clinicians' effort on time-consuming evaluation of the gait data.The findings of the explainable AI method (i.e.LIME) indicate that it can help to understand the reason behind the classification results, which might give insight in mining new knowledge.

Figure 2 .
Figure 2. Sample features computed from an averaged time series normalized to a gait cycle.The orange line shows the tibiotalar flexion of the left foot for a subject with hallux rigidus.The blue line and band show the mean value and variation of the reference typical feet data.Red and green circles show the maximum and minimum values on the data (orange line) and their corresponding timing values on the x-axis, respectively."Difference from normative" is calculated as a feature by averaging the differences denoted by the vertical black lines. https://doi.org/10.1038/s41598-024-56656-4

2 *
condition effectiveness of a classifier model, calculated for each foot condition by weighting the prevalence of it Recall TP TP+FN Effectiveness of a classifier model to identify foot conditions if calculated from sums of per-subject decisions Precision TP TP+FP An average per-foot condition agreement of the real foot conditions of subjects with those predicted by the classifier model F1 score Precision * Recall Precision+Recall Relations between positive labels of data and those given by the classifier model, based on sums of per-foot condition decisions Vol:.(1234567890)Scientific Reports | (2024) 14:5998 | https://doi.org/10.1038/s41598-024-56656-4www.nature.com/scientificreports/additional data could lead to more successful model training.The gap between the training and cross-validation curves indicates a little yet natural overfitting that may decrease as the data size becomes bigger with the new

Figure 3 .
Figure 3. Training graphs of the models.The learning curves and scalability of the model graphs for (a) SVM, (b) RF, (c) LREGR, (d) KNN, (e) MV.X-axis denotes the included number of samples while y-axis refers to the average accuracy (score) and model training times (fit times) in the learning curves and scalability of the model graphs, respectively. https://doi.org/10.1038/s41598-024-56656-4

Table 1 .
Foot conditions, number of subjects, age, height, and weight.
Feature computationEach time series was segmented to gait phases (stance and swing) since segmenting enhances the physiological relevance of the peak values and other computed features for each gait phase, as suggested byWolf et al.

Table 2 .
Computed features for the time series.

Table 4 .
Test scores for each model.Peak scores are in bold.